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DESCRIPTOR VECTORS CHARACTERIZING MOLECULAR REGIONS 

Inventors: Michael C. Pitman 

Daniel E. Piatt 

CROSS-REFERENCE TO RELATED APPLICATIONS 

The present application claims priority to Provisional U.S. Patent Application No. 06/079,196, 
and is related to U.S. Patent Applications (Attorney Docket Y0999-149) and (Attorney Docket No. 
YO999-150), herein incorporated by reference in their entirety. 

BACKGROUND OF THE INVENTION 

Technical Field 

The invention relates to the field of molecular similarity searching, and, more 
specifically, similarity searching in databases of three dimensional molecular structures. 

Description o f the Related Art 

In the field of drug design, where one is attempting to expand the number of lead 
compounds that show activity toward a particular therapeutic target, structural information about 
the target is often lacking or unavailable. Similarity searching in files of chemical compounds is 
a common way to uncover new leads in such situations. Typically, one or more compounds that 
are known to be active toward the target of interest are selected, and a feature scheme is defined 
that characterizes the molecular properties of interest. Features are derived fi*om the selected 
structures and used to search against a database of structures that have been keyed under the 
same feature scheme. 
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Feature schemes may be structural (three-dimensional) in nature or topological (derived 
solely from the molecular graph). Features that are three-dimensional (3D) characterize a whole 
or part of a particular conformation of a molecule, and thus are dependent on the particular 
conformations of the molecules stored in the database, 3D features can include a) 
5 pharmacophoric descriptors, such as distances, angles, or dihedral angular relationships between 
key groups (see Martin, Y. et. al, A fast new approach to pharmacophore mapping and its 
application to dopaminergic and benzodiazepine agonists. J. Comput.-Aided MoL Des. 1992, Vol 
6, pp. 475-486; b) Surfaces characterizations (see Perkins et. al. Molecular surface-volume and 
property matching to superpose flexible dissimilar molecules. J. Comput.-Aided Mol. Des. 1995, 
10_. , Vol 9, 479-490) ; or c) Field-based properties that characterize regions of a molecule (see Willet. 
A et. al. in Similarity searching in files of three-dimensional chemical structures: Flexible 
'J field-based searching of molecular electrostatic potentials, J. Chem. Inf Comput. Sci. 1996, Vol 

H 36, pp. 900-908), 

y I 

Similarity searching for compounds in 3D databases is an important part of lead 
1^5 generation, and is conamonly practiced in the drug design process (see Klebe G, Structural 
111 Alignment of Molecules, in 3D QSAR in Drug Design. Theory, Methods, and Practace, and 
'E Kearsley, S. K. et. al. An alternative method for the alignment of molecular structures: 

Maximizing electrostatic and steric overlap, Tetrohedron Comput, Methodol. 1990, Vol 3, pp. 
615-633). It is usefiil in expanding the list of active compounds for a therapeutic target, finding 
20 new uses for existing compounds, getting around a competitors patent, or gaining more insight 

into the nature of the therapeutic target under investigation. There are, however, several ways one 
may conduct such searches, with no one method proven superior or universally apphcable. Novel 
procedures are thus a current research interest. 

25 A common problem that arises in similarity searching is that of preparing an appropriate 

distance metric. The problem arises when one must decide how to weight the relative importance 
of descriptors when evaluating whether two features are similar. The problem is compounded by 
the fact that different contexts warrant different scalings of descriptors. Appropriate distance 
metrics in one context may not be suitable for another. 
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The present invention embodies a novel procedure for a 3D similarity searching that is 
based on the alignment of heuristic property fields. The particular novelty offered by the present 
invention is the independence of the particular property field used, and a context dependent 
scaling procedure that allows a training set to scale the descriptors. 

SUMMARY OF THE INVENTION 

The problems stated above and the related problems of the prior art are addressed with 
the principles of the present invention, similarity searching of molecules based upon statistical 
analysis of descriptor vectors characterizing molecular regions. 

In a training phase, an association criterion is generated by which query regions of a 
query molecule are associated with regions of molecules stored in a database. Preferably, the 
association criterion is based upon statistical analysis of groups of descriptor vectors that 
characterize properties of the regions of the molecules stores in the database. 

In an acquisition phase, for each molecule in a series of molecules, the following steps 
are performed for a given molecule. Data that represents the structure of the given molecule is 
read from persistent memory and used to define a set of three-dimensional regions of space in the 
given molecule. For each region, one or properties of the given molecule are mapped to 
property values for grid points of the region. A multi-map entry is generated that identifies the 
region, and position and orientation of set of axes derived firom the property values of the grid 
points of the region. The association criterion generated in the training phase is used generate a 
key for the region, and the entry is stored in the multi-map at a location associated with the key. 

In the recognition phase, data that represents the structure of a query molecule is used to 
define a set of regions in the query molecule. For each region, one or properties of the query 
molecule are mapped to property values for grid points of the query region. The association 
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criterion generated in the training phase is used generate a key for the query region. The 
multi-map entry identified by the key is retrieved and the data stored therein are read from the 
table. For each stored region identified by the retrieved table entry, an hypothesized match is 
constructed and added to a vote table. Afi;er processing all of the stored regions identified by the 
retrieved multi-map entry for the set of query regions in the query molecule, one or more entries 
of the vote table is selected, the alignment transformations stored in the selected entries are 
apphed to corresponding molecules stored in the database, and the resultant alignment(s) of the 
stored molecule in the query frame is reported to the user via an I/O device. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIGS. 1(A) and 1(B) are block diagrams of computer processing systems wherein the 
methods of the present invention may be embodied. 

FIGS. 2(A) and (B) is a flov^ chart illustrating the method of the present invention in 
mapping a descriptor vector for an item to a space that optimally discriminates between groups of 
items in accordance with the present invention; 

FIG. 3 is a flow chart illustrating operations of step 205 of FIG. 2 in generating a set of 
component vectors that maximize an F distributed criterion function in accordance with the 
present invention. 

FIG. 4 is a table illustrating a multi-factor design with multi-way analysis of variance. 

FIG. 5 is a flow chart illustrating the training phase of a system that identifies molecules 
within a database of molecules that has similar structure to a query molecule in accordance with 
the present invention. 
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FIGS, 7(A) and (B) is a pictorial illustration of a multi-map, which is part of the system 
that identifies molecules within a database of molecules that has similar structure to a query 
molecule. 

FIGS. 8(A) and (B) is a flow chart illustrating the recognition phase of a system identifies 
molecules within a database of molecules that has similar structure to a query molecule in 
accordance with the present invention. 

FIG. 9 is a flow chart illustrating an exemplary embodiment of the operations in 
constructing the match hypothesis in the recognition phase of the present invention. 

DETAILED DESCRIPTION OF THE INVENTION 

The present invention may be implemented on any computer processing system 
including, for example, a personal computer or a workstation. As shown in FIG. 1(A), a 
computer processing system 100 as may be utilized by the present invention generally comprises 
memory 101, at least one central processing unit (CPU) 103 (one shown), and at least one input 
device 107 (such as a keyboard, mouse Joy stick, voice recognition system, or handwriting 
recognition system). In addition, the computer processing system includes a nonvolatile storage 
device 108, such as a ROM or fixed disk drive, that stores an operating system and one or more 
application programs that are loaded into the memory 101 and executed by the CPU 103. In the 
execution of the operating system and application program(s), the CPU may use data stored in 
the nonvolatile storage device 108 and/or memory 101. In addition, the computer processing 
system includes a graphics adapter 104 coupled between the CPU 103 and a display device 105 
such as a CRT display or LCD display. In addition, the computer processing system may include 
a communication link 109 (such as a network adapter, RF link, or modem) coupled to the CPU 
103 that allows the CPU 103 to communicate with other computer processing systems over the 
communication link, for example over the Internet. The CPU 103 may receive portions of the 
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operating system, portions of the application program(s)5 or portions of the data used by the CPU 
103 in executing the operating system and application program(s). 

It should be noted that the application program(s) executed by the CPU 103 may perform 
the methods of the present invention described below. Alternatively, portions or all of the 
5 methods described below may be embodied in hardware that works in conjunction with the 
application program executed by the CPU 103. 

According to the present invention, a computer implemented method provides a mapping 
(i.e., a transformation) for the components of the descriptor vectors for a series of items to a 
space that optimally discriminates between groups of items. With reference to FIGS. 2(A) and 
lOD (B), the operation begins in step 201 wherein a series of items are classified into N groups (N is 
m an integer greater than 1), wherein each group is identified by an identifier i ranging from 1 to 
I A^. For the sake of description, consider groups that contain rii items and descriptor vectors; 
however, the present invention is not limited in this respect and can be used for groups that 
contain a non-uniform number of items and descriptor vectors. In addition, /n^ identifies item j 
15:3 belonging to group / where j ranges from 1 to rii, and is the descriptor vector 

corresponding to the item ntij . Preferably, the descriptor vectors are stored as part of a file in 
4? persistent storage and loaded into non-persistent storage for use by the CPU 103 as needed. 

In step 203, first data representing co variance between the groups of the items, denoted 
8b , are generated. In addition, second data representing covariance within the items belonging 
20 to the groups, denoted Sw, are generated. Note that both the first data and second data follow a 

chi-square distribution. An example of the operations in determining £^ and e^^ is provided 

below. In this example, the first data {^b) have a chi-square distribution with N-l degrees of 
freedom (where represents the number of groups of items); and the second data (Sw ) have a 

chi-square distribution with - N degrees of freedom (where represents the number of 
25 groups of items, rii represents the number of items in group z, and ^ rii represents the sum of 

ill for the N groups). 
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An example of the operations in determining and is now provided. One may 
ny given d" 
group as follows: 



break any given descriptor vector xfj into variation between groups and variations within a 



Xij = a + +aij , where 

~a represents the mean of all the items (of all the groups); 

represents deviation from mean of all the items (a) to the mean of group /; and 

represents deviation from the mean of group / ( ^/ ) to the descriptor vector Xp . 



For each group i of items (i.e., / ranging from 1 to A^), the sample mean, denoted X^., of the 
descriptor vectors within group i is calculated. For example, the sample mean of the descriptor 
vectors in a given group i may be calculated as follows: 

X;. = 2 ^5 where the ^ operator sums over the range of j = 1 to 



The sample mean ~a of all the items is calculated. For example, the sample mean ~a may be 
calculated as follows: 



~a =j^T^ where the 2 operator sums over the range of i ^ 1 to N 



For each group / of items (i.e., i ranging from 1 to A^, the deviation at is calculated. For 
example, the deviation at for a given group / may be calculated as follows: 



a/ =Xi - a 



For each descriptor vector X;/ , the deviation is calculated. For example, the deviation (X^y 
may be calculated for a given descriptor vector Xy as follows: 
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The covariance between groups, denoted Sb , may be calculated as follows: 



8b=2j^i CLi di 5 where the 2 operator sums over the range 
of i = 1 to TV 



And the covariance within groups, denoted Sw , may be calculated as follows: 



6w =^ S S % % ? where the X 2 operator sums over the range j = 1 

to rti for each group i in the range 1 to 



Note that the covariance between groups Sb has a chi-square distribution with N -1 degrees of 
freedom (where N represents the number of groups of items); and the covariance within groups 

Sw has a chi-square distribution with - N degrees of freedom (where N represents the 

number of groups of items, rti represents the number of items in group i, and S rii represents 
the sum of rit for the N groups). The total covariance, denoted a , is represented as follows: 

8 = ^^ (x^ —~a)(xp --~a)^ = 6b + 6w , where the 2 2 operator sums over 

the range j = 1 to for each 
group / in the range Ito N 

In step 205, an F distributed criterion function is used to determine a set of component 
vectors that maximize the criterion function. The criterion function has a numerator and a 
denominator, whereby the numerator is based upon the first data (Sb) generated in step 203 and 
the denominator is based upon the second data ( 8w ) generated in step 203. The criterion 
function has the general form: 
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/(m;) = CI T7 — -\ 



T ^ ' 

where m> is a vector in the direction whose variation is 



being tested, and C is a constant based upon the degrees of 
freedom in Sb and Sw . 

For example, the constant C may be determined as follows: 

^ _ 1 /degrees of freedom met _ I / (N - I) 
^ ~ 1 /degrees of freedom in Sw i / rii ~ N) 

where represents the number of groups of items, 

rii represents the number of items in a group, and 2 ^/ 
represents the sum of rij for the N groups, and 

The set of component vectors is determined by solving for those w that maximize the criterion 
function . An example of the operation of the computer processing system 100 in 
determining the set of component vectors is set forth below with respect to FIG. 3. For the sake 
of description, the number of component vectors belonging to this set is denoted D, 

In step 207, a loop is performed over the set of component vectors generated in step 205 
(i.e., k=l .. D) to calculate the value, denoted ft , of the criterion function /(^) at the given 
component vector w. If the operation set forth below with respect to FIG. 3 is used, the loop of 
step 207 may be performed over the set of vectors "j^, and the value fk for a given vector ~^ 
may be calculated as follows: 



where = (Sw) ^ w , and 

A —7 

where y is the unit vector corresponding to the vector y 
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In addition, in step 207, the value of the criterion function /(^) at the given component 
vector w and the associated component vector w (or the value fk of the criterion function 
at the given vector y and associated vector y) are stored by the processing system 100. 

In step 209, an F distributed statistic is generated for subsets of the set of component 
vectors generated in step 205. The F distributed statistic for a given subset of component vectors 
preferably represents a ratio of the variance between groups of items to the variance within 
groups of items along the given subset of component vectors. In this case, the statistic, denoted 
If/ s, characterizing a given subset of component vectors preferably has following form: 

¥s-c{£} Zfi 

where/^ represents the value of the criterion function 

) at a given component vector in the subset, 
C is a constant, 

Ls represents the number of fk values in the given subset 
of component vectors, and 

the 2 operation sums over the Ls fk values in the given 
subset of component vectors. 

^ _ l/jN- 1) 

Note that in the example above where ~ Y7(^^l^~^ ' " ^ 's are F distributed with 

the ((A^- l)Ls) degrees of freedom in the numerator and (S - N) degrees of freedom in 
the denominator. 

An optimal subset S of the component vectors is then selected by identifying a subset of 
component vectors such that a probability value for the statistic y/ s associated with the subset 
(preferably, the probability value represents that the probability that the statistic y/ s for the 
subset could have been larger by chance) satisfies a predetermined criterion (significance level). 
This significance level represents a threshold at which an hypothesis that the aggregate 
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F-distributed ratios (for the subset) representing discrimination between groups of items is 
smaller than that within groups of items can be rejected. A small significance level implies a 
very significant rejection of this hypothesis, in turn implying high confidence in the hypothesis 
that the aggregate F-distributed ratios (for the subset) representing discrimination between groups 
of items is greater than or equal to that within groups of items. In this case, the optimal subset 
S of the component vectors is selected by identifying the subset of component vectors whose 
probability value for the statistic y/ s associated with the subset is a minimum. This minimum 
probability value implies a maximum confidence in the hypothesis that the aggregate 
F-distributed ratios (for the subset) representing discrimination between groups of items is 
greater than or equal to that within groups of items. Selection of the optimal subset S is 
preferably accomplished as follows. 

In step 21 1 , the D fk values generated in step 207 are ranked in descending order (i.e., 
from largest to smallest). 

In step 213, a loop is performed over subsets of component vectors whereby, for each 

subset, the statistic ^ ^associated with the subset is generated and a probability value for statistic 
y/s is calculated. Preferably, the loop is performed over a counter X ranging from 1 to D. In 
each iteration of the loop, the following operations are performed. First, the largest X fk values 
identified in step 211 are added together as follows 

(Ai '^fh + **xfkx) ) ? ^d the resultant sum is normalized by a 
division by X (i.e., Fx = Fx/ X). Second, a probability value, denoted px , for the normalized 
sum Fx is calculated. Third, the probability value px and associated counter X are stored. 

The probability value px for a normalized sum Fx may be calculated in step 213 as 
follows: 

QiF, I N-l,X) 

A more detailed description of the calculation of the probability value px for a subset of 
component vectors may be found in Press et al., "Numerical Recipes", Cambridge University 
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Press, 1986, pp. 175-190, herein incorporated by reference in its entirety. And a more detailed 
description of the F distribution function is set forth in Freund, "Mathematical Statistics", 5th 
Ed., Prentice Hall, 1992, pp. 314-315, herein incorporated by reference in its entirety. 



In step 215, one or more probability values px generated in step 213 that satisfy a 
predetermined criterion are selected, and the corresponding subset of component vectors 
corresponding to selected probability value(s) is identified as the optimal subset 5" of component 
vectors. Preferably, in step 215, the minimum probability value px generated in step 213 is 
selected, and the corresponding subset of component vectors corresponding to selected 
probability value is identified as the optimal subset S of component vectors. 

Finally, in step 217, one or more of the descriptor vectors for the series of items are 
mapped to a space corresponding to the optimal subset S of component vectors. This mapping 

is preferably accomplished for a given descriptor vector x| by performing a loop over each 
component vector 'w belonging to the optimal subset 5' of component vectors whereby, in each 
iteration of the loop, the contribution w Xjj (where w is the transpose of the xmit vector w for 
the given component vector 'w) is added to a running sum. Consider an example wherein the 



optimal subset S includes three component vectors (^1,^2, ^3) , In this example, the 
mapping, denoted M, may be represented by the following: 



FIG. 3 illustrates an example of the operation of the computer processing system 100 in 
determining the set of component vectors (i.e., those w) that maximize the criterion function 



M(Xy)-C 



' T 
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/(w). The approach defines a vector y that is a function of w along which the contribution from 
the denominator of the criterion function fiw) is independent. 

In step 301, the set of eigenvalue/eigenvector pairs of the matrix are calculated. This 
may be accomplished, for example, with the techniques set forth in Press et al, "Numerical 
Recipes", Cambridge University Press, 1986, pp. 349-363, herein incorporated by reference in its 
entirety. The set of eigenvalue/eigenvector pairs for which the eigenvalue is non-zero, denoted 
set is then stored by the computer processing system 100. For the sake of description, the 
number of eigenvalue/eigenvector pairs belonging to set is denoted K, and a given 
eigenvalue/eigenvector pair belonging to set £w is denoted (ek, vt ). 

The operations may then define a vector, denoted ^ , as a function of as follows: 

where fewF =2 A v/ fe)' 

where Vk is the unit vector corresponding to eigenvector vt and 
the 2 operation sums over the K eigenvalue/eigenvector pair 
belonging to set 

The criterion function f(w) may be rewritten as /(y) as follows: 

where y is the unit vector corresponding to the vector 

The set of vectors "y" that maximize the criterion function f(j) may be generated by 
solving the following eigenvalue equation: 

(sm>T^ SbiswT^ n =fkyk 
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Finally, the set of vectors y that maximize the criterion function f(y) may then be 
transformed to generate the corresponding set of component vectors w that maximize the 
criterion function f(w) as follows: 

In step 303, the computer processing system 100 calculates the matrix 
_± 1 

(Sw) ^ Sb(Sw) ^ utilizing the eigenvalue/eigenvector pairs belonging to set £w calculated in 
step 301. Note that (fiw)^ may be calculated as follows: 

where Vk is the unit vector corresponding to eigenvector and 
the S operation sums over the K eigenvalue/eigenvector pair 
belonging to set 

In addition, in step 303, the computer processing system 100 calculates the set of 

eigenvalue/eigenvector pairs for the matrix (Sw) ^ SbiSw) ^ . This may be accomplished, for 
example, with the techniques set forth in Press et aL, "Numerical Recipes", Cambridge 
University Press, 1986, pp. 349-363, incorporated by reference above in its entirety. The set of 
eigenvalue/eigenvector pairs for which the eigenvalue is non-zero, denoted set E^h is then stored 
by the computer processing system 100. For the sake of description, the number of 
eigenvalue/eigenvector pairs belonging to set E^b is denoted K and a given 
eigenvalue/eigenvector pair belonging to set Ey,b is denoted ( ,vt^ ). The K ' eigenvectors 
(each denoted vto of the eigenvalue/eigenvector pairs belonging to set Ey,t represent the set of 
vectors ^ that maximize the criterion function /(^). 
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In step 305, the computer processing system 100 transforms the eigenvectors (each 
denoted vto of the eigenvalue/eigenvector pairs belonging to set Ey,b to a set of corresponding 
component vectors in 'w space (each denoted vl^ ) as follows: 



The K ' set of vectors (each denoted w k ) represent the set of component vectors that maximize 
the criterion function /("vv). 

As described above, the computer implemented method of the present invention provides 
a mapping (i.e., a transformation) for the descriptor vectors for a series of items to a space that 
optimally discriminates between groups of items. The method may be used in many domains. 

For example, in the domain of bio-informatics, the items of interest may be genotypes 
that are partitioned into groups based upon phenotypes exhibited by such genotypes; and the 
descriptor vectors associated with such genotypes may represent biological, chemical, and/or 
physical properties of such genotypes. Typically, a candidate genotype and associated descriptor 
vector is identified and one wishes to ascertain to which group the candidate genotype belongs. 
The method discussed above may be used to provide a suggestion as to which group the 
candidate genotype belongs. 

More specifically, the method described above is used to map the descriptor vector for 
each genotype to a space that optimal discriminates between groups. In addition, the statistical 
mean of the mapped descriptor vectors for each group [or some other statistical variable (such as 
the covariance about the mean) based upon the mapped descriptor vectors] is calculated. In 
addition, the mapping function that optimally discriminates between groups is applied to the 
descriptor vector for the candidate genotype. A suggestion as to which group the candidate 
genotype belongs is then determined based upon differences between the mapped descriptor 
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vector for the candidate genotype and the statistical mean of the mapped descriptor vectors for 
the groups. 

In order to illustrate this example in more detail, consider the case where the genotypes 
are a population of male patients partitioned into two groups (A3) based upon prostate cancer 
5 activity experienced by the patients (i.e., patients belonging to group A have not experienced any 

prostate cancer activity, and patients belonging to group B have experienced prostate cancer 
activity). In addition, the descriptor vector associated with each male patient is expression data 
acquired from a gene probe array system, such as the GeneChipsystems developed by 
Affymetrix, Inc. (information is available at http://www.afiymetrix.com/products/). In addition, 
1 0 a candidate and associated descriptor vector has been identified. Again, the descriptor vector 

S associated with the candidate is expression data acquired from a gene probe array system, and 
f H one wishes to ascertain to which group the candidate belongs. The method discussed above may 
in be used to provide a suggestion as to which group the candidate belongs. More specifically, the 
ill method described above is used to map the descriptor vector for each patient to a space that 
15'""^ optimal discriminates between groups. In addition, the statistical mean of the mapped descriptor 
O vectors for each group (or some other statistical variable, such as the covariance about the mean, 
111 based upon the mapped descriptor vectors) is calculated. In addition, the same mapping function 
;5 is applied to the descriptor vector for the candidate. Finally, a suggestion as to which group the 
^0 candidate belongs is determined based upon differences between the mapped descriptor vector 
20 for the candidate and the statistical mean of the mapped descriptor vectors for the groups A,B. 

In another example in the domain of bio-informatics, the items of interest may be gene 
sequences that are partitioned into groups; and the descriptor vectors associated with such gene 
sequences may represent expression activities of such gene sequences. The method discussed 
above may be used to derive a probability value for the statistic y/ s that represents a significance 
25 level at which an hypothesis that the aggregate F-distributed ratios representing discrimination 

between groups of expression activities of gene sequences is smaller than that within groups of 
expression activities of gene sequences can be rejected. A small significance level implies a very 
significant rejection of this hypothesis, in turn implying high confidence in the hypothesis that 
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the aggregate F-distributed ratios representing discrimination between groups of expression 
activities of gene sequences is greater than or equal to that within groups of expression activities 
of gene sequences. In the event that this probability value is less than a threshold value, there is a 
suggestion that the gene sequences express differences in activities. 

In the domain of computational biology and chemistry, for example, the items of interest 
may be molecular complexes (a molecule or portion of a molecule) that are partitioned into 
groups based upon a biological activity (for example, molecular complexes belonging to a group 
all bind to another molecular complex); and the descriptor vectors associated with such 
molecular complexes may represent the physical property (structure, charge distribution), 
biological property and/or chemical property of such molecular complexes. For instance, the 
descriptor vectors associated with a molecule may be based upon data representing the structure 
and/or charge distribution of molecule as outlined in i) U.S. Patent No. 5,784,294 to Piatt et al., 
which is commonly assigned to the assignee of the present invention, ii) R.D. Cramer et al., 
"Comparative Molecular Field Analysis (CoMFA). Effect of Shape on Binding of Steroids to 
Carrier Proteins, J. Am. Chem. Soc. Vol. 110, 1988 pp. 5959-5967, iii) A.C. Good et al., 
"Structure- Activity Relationships from Molecular Similarity Matrices," J. Med. Chem., Vol. 36, 
1993, pp. 433-438, iv) A. Jain et al., "Predicting Biological Activities from Molecular Surface 
Properties. Performance Comparisons on a Steroid Benchmark, " J. Med. Chem., Vol. 37, 1994, 
pp. 2315-2327, v) W. Fisanick et al, "Similarity Searching on CAS Registry Substances, 1: 
Global Molecular Property and Generic Atom Triangle Geometric Searching," Journal of 
Chemical Information and Computer Sciences, Vol. 32, No. 6, 1992, pp. 664-674, and W. 
Fisanick et al., "Similarity Searching on CAS Registry Substances, 2: 2D Structural Similarity," 
Journal of Chemical Information and Computer Sciences, Vol. 34, No. 1, 1994, pp. 130-140, 
hereinafter incorporated by reference in their entirety. Similarly, descriptor vectors may be 
associated with a portion of a molecule (i.e., a subset of the atoms that make up a molecule) and 
represent the structure and/or charge distribution of the portion of the molecule. In such 
domains, typically, a candidate molecular complex and associated descriptor vector is identified 
and one wishes to ascertain to which group the candidate molecular complex belongs. The 
computer implemented method of the present invention described above may be used to provide 
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a suggestion as to which group the candidate molecular complex belongs. More specifically, the 
method described above is used to map the descriptor vector for each molecular complex to a 
space that optimal discriminates between groups. In addition, the statistical mean of the mapped 
descriptor vectors for each group [or some other statistical variable (such as the covariance about 
the mean) based upon the mapped descriptor vectors] is calculated. In addition, the mapping 
function that optimally discriminates between groups is applied to the descriptor vector for the 
candidate molecular complex. A suggestion as to which group the candidate molecular complex 
belongs is then determined based upon differences between the mapped descriptor vector for the 
candidate molecular complex and the statistical mean of the mapped descriptor vectors for the 
groups. 

The description above illustrates application of the present invention in one-way analysis 
of variance. However, the present invention is not limited in this respect, and can be applied to 
multi-factor designs with multi-way analysis of variance. An example of such a design for an 
item of mterest is illustrated in FIG. 4. In this design, a series of items are classified into 
groups ( is an integer greater than 1), wherein each group is identified by an identifier / 
ranging from 1 to represented by the columns of the matrix. A series of factors (typically each 
factor represents one or more experimental treatments) are attributed to the items belonging to 
the groups, wherein each factor is identified by an identifier; ranging from 1 to M represented 
by the rows of the matrix. An identifier rriyk identifies item k that is attributed to the 
group/factor pair i,j , and vector identified a descriptor vector attributed to the item 
rriijk . Preferably, the descriptor vectors are stored as part of a file in persistent storage and 
loaded into non-persistent storage for use by the CPU 103 as needed. 

In such a system, one may break any given descriptor vector Xijt into the following 
components: 

_^ ^ ^ -> 

Xijk = a+fii + Jj +dij+Sijk , where 
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"a represents the mean of the descriptor vectors for all the items (of all the groups and all 



the factors); 



~Pi represents the mean of the descriptor vectors for all the items in group / (of all the 



factors); 



)^represents the mean of the descriptor vectors for all the items in factory (of all the 
groups); 

Sij represents the mean of the deviation of the descriptor vectors for all the items m 

^ ^ 

group/factor pair ij from the sum ( a+^i + yj); and 

— > ^ — > 

S ijk represents deviation from the mean Sij to the descriptor vector X yk . 



These values may be calculated as foUov^s. For each group/factor pair iJ of items (i.e., 
/ ranging from 1 to N, and j ranging from 1 to M), the mean of the descriptor vectors within 

group/factor pairz j , denoted , may be calculated as follows: 

^ = 7^ S where ntj represents the number of items in the 

group/factor pair iJ , and the 2 operator sums over 
the range (k = L. ntj) of items in the group/factor 
pair iJ 

Then, for each group i of items (i.e., i ranging from 1 to AO, the meanJcT may be calculated as 
follows: 

xZ = S ^ where ^ i represents the number of items in the 
group / , and the 2 operator sums over 
the range (1., M) of factors 

Then, for each factor j (i.e., i ranging from 1 to M), the mean^.y. may be calculated as follows: 



Y0998-112 



-19- 



^ = S ^ where nj represents the number of items in the 
factor J , and the 2 operator sums over 
the range (L. AO of groups 



Then, the mean a may be calculated as follows: 



= j^^xZ where represents the number of groups , and 
the 2 operator sums over the range (L. AO of 
groups; or equivalently 

~a = M ^ ^ where M represents the number of factors, and 
the S operator sums over the range (L. M) of 
factors 

For each group i of items (i.e., i ranging from 1 to AO, the mean A may be calculated as 



follows: 



For each factor j of items (i.e., j ranging from 1 to M), the mean 7]' may be calculated as 



follows: 



yj - ^ 



For each group/factor pair /j of items (i.e., / ranging from 1 to A^, andy ranging from 1 to M), 

the mean dij may be calculated as follows: 

— > 
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Finally, for each descriptor vector X/^7b in the group/factor pair / j, the deviation 8 ijk may be 
calculated as follows: 

The total covariance, denoted e , may be represented by the sum of the following four (4) 

terms: 

(1) E Mi pi pi , where the S operator sums over the range of groups 

i = 1 to and Mi is the number of items in group i. 

(2) 'jj'yf , where the S operator sums over the range of factors 

and Nj is the number of items in factor i. 

-^^^ 

(3) zL^ij . where the X operator sums over the range of 

group/factor pairs i J and A^^' is the number if items 
belonging to the group/factor pair ij. 

(4) S ^ijk 8 ijk , where the 2 operator sums over the range of 

group/factor pairs i j and k items in each group/factor pair 

Note the first term represents the covariance between groups of items; the second term represents 
the covariance between factors; and the third term represents the covariance of the interaction 
between the groups and factors. 

Generally, the principles of the present invention as described above may then be used to 
map the components of the descriptors vectors of the items to a space that: 

1) optimally discriminates between the groups of items; 

2) optimally discriminates between factors; or 

3) optimally discriminates interactions between the groups and factors. 
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More specifically, in order to map the components of the descriptors vectors of the items 
to a space that optimally discriminates between groups of items (case 1 above), an F distributed 
criterion function is used to determine a set of component vectors that maximize the criterion 
function . The criterion function has a numerator and a denominator, whereby the numerator is 
based upon the covariance between groups of items (the first term described above) and the 
denommator is based upon the fourth term described above. The criterion function has the 
general form: 



w 



W 



w 



where is some vector 



The set of component vectors is determined by solving for those w that maximize the criterion 
function /(vv), and an optimal subset of the component vectors is identified. A more detailed 
description of this operation is described above with respect to steps 205- 215 of FIG. 2. Finally, 
the components of the descriptor vectors for the items are mapped to a space corresponding to 
the optimal subset of the component vectors. The resultant data optimally discriminates between 
groups of items. A more detailed description of this operation is described above with respect to 
step 217 of FIG. 2. 

In order to map the components of the descriptors vectors of the items to a space that 
optimally discriminates between factors (case 2 above), an F distributed criterion function is used 
to determine a set of component vectors that maximize the criterion function . The criterion 
function has a numerator and a denominator, whereby the numerator is based upon the 
covariance between factors (the second term described above) and the denominator is based upon 
the fourth term described above. The criterion function has the general form: 



w 



w 



where w is some vector 
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The set of component vectors is determined by solving for those w that maximize the criterion 
function f(w), and an optimal subset of the component vectors is identified. A more detailed 
description of this operation is described above vdth respect to steps 205- 215 of FIG. 2. Finally, 
the components of the descriptor vectors for the items are mapped to a space corresponding to 
the optimal subset of the component vectors. The resultant data optimally discriminates between 
factors. A more detailed description of this operation is described above with respect to step 217 
of FIG. 2. 

In order to map the components of the descriptors vectors of the items to a space that 
optimally discrhninates interactions between the groups and factors (case 3 above), an F 
distributed criterion function is used to determine a set of component vectors that maximize the 
criterion function . The criterion function has a numerator and a denominator, whereby the 
numerator is based upon the covariance of the interaction between the groups and factors (the 
thhd term described above) and the denominator is based upon the fourth term described above. 
The criterion fimction has the general form: 



w 



w 



w 



where ~w is some vector 



The set of component vectors is determined by solving for those that maximize the criterion 
function /(vv), and an optimal subset of the component vectors is identified. A more detailed 
description of this operation is described above with respect to steps 205- 215 of FIG. 2. Finally, 
the components of the descriptor vectors for the items are mapped to a space corresponding to 
the optimal subset of the component vectors. The resultant data optimally discriminates 
interaction between the groups and factors. A more detailed description of this operation is 
described above with respect to step 217 of FIG. 2. 
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The application of the present invention to multi-factor designs may be used in many 
domains. 

In the domain of insurance, for example, the items of interest may be individuals that are 
partitioned into groups based upon certain characteristics such as age, gender, etc. Categories of 
auto insurance polices (for example, categorized based upon the limits/deductable amount for 
liability coverage and/or collision coverage) attributed to the groups of individual may represent 
the factors of the design. The descriptor vectors associated with such individuals may represent 
risk of the individual (for example, may be a dollar amount of the moneys paid to the individual 
arising from auto insurance claims during a predetermined period of time). In this example, the 
computer implemented method discussed above may be used to provide a mapping of the 
components of the descriptor vectors to a space that optimal discriminates between groups of 
individuals; or optimally discriminates between the policies, or optimally discriminates 
interaction between the groups and policies. 

In the domain of agriculture, for example, the items of interest may be plant species that 
are partitioned into groups based upon certain characteristics such as genetic makeup of the plant 
species. Categories of fertihzers (or pesticides) that are applied to the groups of plant species 
may represent the factors of the design. The descriptor vectors associated with such plant species 
may represent a characteristic of the plant species such as yield and/or drought resistance. In this 
example, the computer unplemented method discussed above may be used to provide a mapping 
of the components of the descriptor vectors to a space that optimal discriminates between groups 
of plant species; or optimally discriminates between the fertilizers (or pesticides), or optimally 
discriminates interaction between the groups and fertilizers (or pesticides). 

It should be noted that the information encoded by the descriptor vectors may cause 
membership in a group associated with an item of interest. For example, the behavior of a 
molecular complex, which is encoded by a descriptor vector associated with the molecular 
complex, may cause membership in a group (or category) of molecular complexes (for example, 
hydrophilic, polar, "active" with respect to a class of reactions, etc.). In the alternative, the 
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information encoded by the descriptor vector associated with an item of interest may be a 
response to group membership. For example, the risks associated with an insurance policy may 
be encoded by a descriptor vector associated with the insurance poUcy, whereby the risks are 
dictated by the group (i.e., type) membership of the policy. 

In addition, the present invention may be used to identify molecules within a large 
database of molecules that are similar to a query molecule. The methodology may be 
conceptually divided into three phases: a training phase, an acquisition phase, and a recognition 
phase. 

In the training phase, for each molecule in a series of molecules (denoted the training 
set), the following steps are performed for a given molecule in the traimng set. Data that 
represents the atomic structure of the given molecule in the training set is stored in persistent 
memory. In addition, data that represents one or more properties of the given molecule is stored 
in persistent memory. The data that represents the atomic structure of the given molecule is read 
from persistent memory and used to define a set of three-dimensional regions of space that 
contain portions of the given molecule. For each region of the given molecule, one or more 
properties of the given molecule are generated and mapped to a one or more property values for 
the grid points of the region; and the properties values for the grid points of the region are used to 
determine a descriptor vector associated with the region that characterizes the region. Preferably, 
one or more components of the descriptor vector associated with a given region represent 
position and/or orientation of axes derived from the property distribution of the region, wherein 
the axes are invariant with respect to translation and rotation of the region. In addition, a group 
identifier is assigned to the region. The group identifier identifies the group to which the region 
belongs. For example, the regions may be partitioned into groups based upon the charge 
distribution of the region (e.g., charged, neutral, polar, non-polar, etc.) or other property or 
behavior of the region. A mapping of one or more components of the descriptor vector for the 
regions is then calculated (preferably by applying the method discussed above is used to generate 
a mapping of components of the descriptor vectors to a space that optimally discriminates 
between the groups of regions - step 217 of FIG. 2). The mapping generated in the training phase 
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1 



is stored in persistent storage and used in the acquisition phase and recognition phase. A more 
detailed description of the training phase for a particular region is illustrated in FIG. 5. 

In the acquisition phase, for each molecule in a series of molecules (which is typically 
distinct from the series of molecules processed in the training phase), the following steps are 
5 performed for a given molecule. Similar to the training phase, data that represents the atomic 

structure of the given molecule is used to define a set of regions that contain portions of the 
given molecule, and descriptor vectors are generated for such regions of the given molecule. 
Similar to the training phase, one or more items of the descriptor vector associated with a given 
region preferably represents position and/or orientation of axes derived from the property 
1 0 _, distribution of the region, wherein the axes are invariant with respect to translation and rotation 
5 of the region. The mapping generated in the training phase is used to map one or more 
CI components of the descriptor vector for each given region (preferably to a space that optimally 
!5 discriminates between groups of regions), and a key is generated based upon the mapping of the 
ffl component(s) of the associated descriptor vector. The key identifies an entry in a multi-map 
1 5:^ (described below) stored in persistent memory. Data identifying the given region, data (or a 
Q pointer to such data) characterizing a set of axes derived from the property distribution of the 
5 region, and preferably other data (for example, data identifying the molecule to which the region 
% belongs, and data representing the geometric center of the molecule to which the region belongs) 
* are then stored in the multi-map at a location identified by the key. As described below in more 
20 detail, the data characterizing a set of axes derived from the property distribution of a given 

region preferably characterizes transformation between an input reference frame and the inertial 
reference frame for the given region. A more detailed description of an exemplary embodiment 
of the acquisition phase for a particular region is illustrated in FIG. 6. Note that if there are 
common molecules (or regions) in the series of molecules processed in the training phase and 
25 acquisition phase, data generated in the training phase may be used in the acquisition phase, and 

the operations that use and/or generate such data may be bypassed accordingly. 

In the recognition phase, a query molecule is provided. Similar to the training and 
acquisition phase, data that represents the atomic structure of the query molecule is used to 
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define a set of regions that contain portions of the query molecule, and descriptor vectors are 
generated for such regions of the query molecule. Similar to the training phase, one or more 
items of the descriptor vector associated with a given region preferably represents position and/or 
orientation of axes derived from the property distribution of the region, wherein the axes are 
5 invariant with respect to translation and rotation of the region. For each query region, the 
mapping generated in the training phase is used to map one or components of the descriptor 
vector for the given query region (preferably to a space that optimally discriminates between 
groups of regions), and a key is generated based upon the mapping of the component(s) of the 
associated descriptor vector. For each key, a multi-map entry identified by the key is retrieved 
1 0 and data stored therein [i.e., data identifying one or more regions associated therewith, data 
characterizing a set of axes derived from the property distribution of each region associated 
kQ therewith, data identifying the molecule to which each region associated therewith belongs, and 
data representing the geometric center of the molecule to which each region associated therewith 
belongs] are read from the table. For each region identified by the retrieved table entry, an 
1 5j1 hypothesized match is constructed and added to a vote table. After processing all of the regions 
identified by the retrieved table entry, the vote table is sorted to determine a set of potential 
M matching regions, and the set of potential matching regions (and/or the molecules to which the 
ill set of potential matching regions belong) is made accessible to the user via an I/O device. A 
^ more detailed description of an exemplary embodiment of the recognition phase for a region of 
the query molecule is illustrated in FIGS. 7(A) and (B). 

Training Phase 

In the training phase, for each molecule in a training set, data that represents the structure 
of the given molecule is stored in persistent memory. Preferably, the data represents the atomic 
structure of the given molecule in an arbitrary three-dimensional reference frame, which is 
25 referred to below as the "input reference frame". The data may be obtained from a database 

(public or private) or be derived by traditional molecular modeling techniques. 
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In addition, for each molecule in the training set, data that represents the property 
distribution of one or more properties of the given molecule is generated, for example by reading 
such data from persistent memory. The property distribution for the property of a given molecule 
may be an atomic property distribution (i.e., data representing a heuristic, such as 
electironegativity, hydrophobicity or polarity, that characterizes tiie behavior or property of an 
atom in the given molecule), a surface property distiibution (i.e., data representing a heuristic that 
characterizes the behavior or property of a surface on or within the molecule), or a volumetric 
property distiibution (i.e., data characterizing tiie behavior or property of a region of the 
molecule- for example, electron density). 

In addition, for each given molecule in the training set, tiie structure data is used to 
generate a set of three-dimensional regions of space, hereinafter referred to "scoops", in the input 
reference frame. Preferably, each scoop is a spherical region in the input reference frame having 
a center (i.e., a point in the input reference frame) and a radius. Preferably, the center of each 
scoop corresponds to one or more heuristics of the given molecule. For example, the heuristic 
may correspond to the coordinate of i) the nucleus of one or more atoms of the given molecule, 
ii) the center of one or more bonds between atoms of the given molecule, iii) tiie end point of an 
extension to a bond (typically tiie length of the extension is a factor of bond length) between 
atoms of the given molecule, iv) a grid points in the given molecule, v) the geometiic center of 
the molecule, vi) ring centers in the given molecule, or vii) lone pairs of elections in the given 
molecule. In addition, the radius of each scoop may be set to a predetermined value (for 
example, 3 angstioms, or 5 angstioms) or may be based upon heuristics of the given molecule. 
It should be noted that a scoop can be any arbitrary three-dimensional region in the input 
reference frame. 

As illustrated in FIG. 5, the training phase preferably performs a nested loop over tiie 
molecules in the tiaining set; wherein, for a given molecule, each scoop in the set of scoops for 
the given molecule is processed as follows. 
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In step 501, the property distribution of one or more properties of the given molecule that 
is relevant to the scoop are identified, and the relevant property distribution is mapped to a 
property field that represents the value of the property at points in the input reference frame. 
There are many well-known techniques to map a property distribution (atomic, surface, 
volumetric) to a property field, the specifics of which is not relevant to the present invention. 
Examples of such techniques may be found in U.S. Patent 5,025,388 to Cramer, III et al, 
hereinafter mcorporated by reference in its entirety. With regard to atomic properties for the 
given molecule, preferably the properties of atoms that are contained within the scoop are 
identified as relevant properties for the scoop in step 501. With regard to surface properties for 
the given molecule, preferably the properties of surfaces that are contained (partially or fully) 
within the scoop are identified as relevant properties for the scoop in step 501 . With regard to 
volumetric properties for the given molecule, preferably the properties of volumes that are 
contained (partially or fully) within the scoop are identified as relevant properties for the scoop in 
step 501. 

In step 503, the property field generated in step 501 is mapped to grid points contained in 
the scoop to determine contribution of property field at the grid points. Preferably, the grid 
points are evenly spaced within the given scoop. In addition, the contribution of the property 
field at a given grid point preferably includes two values: the first value is a positive value, 
which for the sake of description is referred to below as a mass value; the second value is a real 
value, which for the sake of description is referred to below as a charge value. There are many 
possible techniques to calculate the contribution of a property field at a given grid point, the 
specifics of which is not relevant to the present invention. For example, the following "smearing 
function" can be used in step 503 to map values of the points of the property field to a first value 
fii (mass value) at grid point i: 

jUi = Aj* e 

where Aj represents the value of the property field at a point; in the 
property field, 

Rij represents the distance between the point; in the property field 
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and the point / in the scoop; 
L 0 is a damping factor, which typically ranges between .5 and 
2.5; 

5 is an arbitrary constant, which typically is 1 ; and 

X sums over all the points in the property field for the scoop. 

Alternate smearing functions may be gaussian forms. In addition, the following equation may be 
used to map values of the points of the property field to a second value Xi (charge value) at grid 
point /: 

where Ji represents the mean first value (mass value) for all the grid 
points of the scoop 

In step 505, a descriptor vector for the scoop is generated based upon the property values 
of the grid points of the scoop. Preferably, the components of the descriptor vector for the scoop 
includes one or more of the following data values, details of which are set forth below: 

ID data value identifying the molecule from which the scoop is 

derived 

N an integer value representing the number of grid points for the 

scoop 

M total mass for the scoop, which is a sum of the first values (i.e., 

"mass values" or ///) for the grid points of the scoop 

Q total charge for the scoop, which is the sum of the second values 

(i.e., "charge values" or Xi ) the grid points of the scoop 
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Sx 5 Sy 5 Sz 



components of a vector describing translation between origin of 
the input reference frame and center of mass in the input reference 
frame 



Ix , ly , Iz principal values of the moment of inertia tensor of the scoop with 

respect to the center of mass of the scoop 

Se, So , Sa So , Se are polar angles that describe an axis of rotation. 

Sn represents an angle of rotation about that angle. 
Together, the angles represent a rotation transformation between 
principal axes of the moment of inertia tensor and axes of the input 
reference frame 



Vx , Vy , Vz components of a vector describing translation between the center 

of scoop in the inertial reference frame and center of mass in the 
inertial reference frame 



dx , dy , dz components of a vector describing translation between center of 

mass in the inertial reference frame and center of dipole in the 
inertial reference frame 



Cix , Ciy , Ciz components of a vector describing third order moment of mass 
about a center of expansion in the internal reference frame 

Cqx , Cqy , Cqz components of a vector describing third order moment of charge 
about a center of expansion in the internal reference frame 

Qxx , Qxy , Qxz components of a tensor characterizing quadrupolar moment about a 
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Qyy , Qyz J Qzz center of expansion in the internal reference frame 

Note that the components ID, N, M, Q, Ix , ly , Iz , Vx , Vy , Vz , dx , dy , dz , Cix , Ciy , Ciz, 
Cqx , Cqy , Cqz, Qxx , Qxy , Qxz, Qyy , Qyz , Qzz characterize the property distribution with 
respect to invariant axes (i.e., axes that are invariant with respect to translation and rotation of the 
given scoop - for example the sensed inertial axes of the given scoop). Also note that the 
components Sx , Sy , Sz , Se , S<d , Sn are variant with respect to translation and rotation of the 
scoop, and characterize orientation of the invariant axes with respect to the input reference frame 
of the scoop. A more detailed description of the calculation of the components of the descriptor 
vector for a given scoop is set forth below. 

In step 507, a group identifier is assigned to the scoop. The group identifier identifies the 
group to which the region belongs. For example, the scoops may be partitioned into groups 
based upon the charge distribution of the scoop (e.g., charged, neutral, polar, non-polar, etc.) or 
other property or behavior of the scoops. 

In step 509, statistical analysis is used to generate a mapping of one or more components 
of the descriptor vector for the scoop. Preferably, the computer implemented method of the 
present invention discussed above is used to generate a mapping of the components of the 
descriptor vector for the scoop that characterize position and/or orientation of invariant axes 
derived from the property distribution of the given scoop (i.e., ID, N, M, Q, L , ly , Iz , Vx , Vy , 

Vz, dx,dy,dz, Cix, Ciy, Ciz, Cqx, Cqy, Cqz, Qxx, Qxy, Qxz, Qyy, Qyz, Qzz) to a 

space that optimally discriminates between the groups of scoops (described above with respect to 
step 217 of FIG. 2). As described above with respect to step 217 of FIG. 2, the mapping is based 
upon the transpose of unit vectors for the component vectors w\ , w\ , . The mapping 
generated in step 509 is preferably stored in persistent storage for subsequent use in the 
acquisition phase and recognition phase. 
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Acquisition Phase 



In the acquisition phase, for each molecule in a series of molecules, data that represents 
the structure of the given molecule is stored in persistent memory. Preferably, the data 
represents the atomic structure of the given molecule in an arbitrary three-dimensional reference 
frame, which is referred to below as the "input reference frame". The data may be obtained from 
a database (public or private) or be derived by traditional molecular modeling techniques. 

In addition, for each molecule in a series of molecules, data that represents the property 
distribution of one or more properties of the given molecule is generated, for example, by 
reading such data from persistent memory. The property distribution for a property of the given 
molecule may be an atomic property distribution, a surface property distribution, or a volumetric 
property distribution as described above. 

In addition, for each given molecule in the series of molecules, the structure data is used 
to define a set of scoops in the input reference frame. Preferably, each scoop is a spherical region 
in the input reference frame having a center and radius as described above. 

As illustrated in FIG. 6, the acquisition phase preferably performs a nested loop over the 
series of molecules, wherein, for a given molecule, each scoop in the set of scoops for the given 
molecule is processed as follows. 

In step 601, properties of the given molecule that are relevant to the scoop are identified, 
and the relevant properties are mapped to a property field that represents the value of the property 
at points in the input reference frame. This operation is similar to the processing described above 
with respect to step 501 of the training phase. 

In step 603, the property field generated in step 601 is mapped to grid points contained in 
the scoop to determine contribution of property field at the grid points. Preferably, the 
contribution of property field at a given grid point includes two values: a positive mass value 
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and a real charge value. This operation is similar to the processing described above with respect 
to step 503 of the training phase. 

In step 605, a descriptor vector for the scoop is generated based upon the property values 
of the gird points of the scoop generated in step 603. Preferably, the components of the 
descriptor vector for the scoop includes one or more of the data values as described above with 
respect to step 505 of the training phase. 

In step 607, the mapping generated in step 509 of the training phase is used to map one or 
more components of the descriptor vector for the scoop (preferably, the component(s) of the 
descriptor vector is mapped to a space that optimally discriminates between the groups of scoops 
as described above with respect to step 217 of FIG. 2). 

In step 609, a key is generated based upon the mapping of the component(s) of the 
descriptor vector for the scoop in step 607. The key identifies an entry in a multi-map stored in 
persistent memory. The multi-map is an associative memory which permits more than one entry 
stored in the memory to be associated with the same key. A detailed description of a multi-map 
is set forth in D.R. Musser and Atul Saini, STL Tutorial and Reference Guide: C++ 
Programming with the Standard Template Library (Addison- Wesley 1996), herein incorporated 
by reference in its entirety. Preferably, the multi-map is formed from a hash table. A more 
detailed description of a hash table may be found in R. Sedgewick, Algorithms in C++ 
(Addison- Wesley 1992), herein incorporated by reference in its entirety. In the alternative, the 
multi-map container may be formed from a hnked list data structure, or a tree structure such as 
an AVL-tree or B* tree as described in D.R. Musser and Atul Saini, herein incorporated by 
reference in its entirety. One skilled in the art will realize that there are many possible 
underlying implementations for the multi-map data structure. 

Finally, in step 611, data identifying the given scoop, data (or a pointer to such data) 
characterizing a set of axes derived from the property distribution of the scoop, and preferably 
other data (for example, data identifying the molecule to which the scoop belongs, and data 
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representing the geometric center in the input reference frame of the molecule to which the scoop 
belongs) are then stored in the multi-map at a location identified by the key generated in step 
609. Preferably, the data characterizing a set of axes derived from the property distribution of a 
given scoop characterizes transformation between an input reference frame and the inertial 
reference frame for the given scoop - the components Sx , Sy , Sz , Se ? So , Sa of the descriptor 
vector for the scoop as discussed above with respect to step 605. 

An exemplary multi-map entry is illustrated in FIGS. 7(A) and (B). The entry 701 
includes a series of segments 703-1 and 703-2 (two shown) coupled via a link-list data structure. 
Each segment includes an first ID field 71 1 that stores an identifier for a scoop, a pointer 713 to 
data (or the data itself) characterizing a set of axes derived from the property distribution of the 
scoop, a second ID field 715 identifying the molecule to which the scoop belongs, a pointer 717 
to data (or the data itself) representing the geometric center of the molecule to which the scoop 
belongs, and a pointer 719 to the next segment in the table entry. 

The operations of step 601-61 1 are applied for each scoop in a given molecule. At the 
end of the acquisition phase^ the multi-map stores entries each corresponding to one or more 
scoops for the series of molecules being studied. 

Recognition Phase 

FIGS. 8(A) and (B) illustrates the recognition phase for a query molecule. In step 801, 
data that represents the structure of the query molecule in the input reference frame is stored in 
persistent memory. In step 803, data that represents the property distribution of one or more 
properties of the query molecule is generated, for example by reading such data from persistent 
memory. The property distribution of the property for a given molecule may be an atomic 
property distribution, a surface property distribution, or a volumetric property distribution as 
described above. In step 805, the structure data of the query molecule is used to define a set of 
scoops (in the input reference frame) in the query molecule. In steps 807-825, a loop is 
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performed over the set of scoops in the query molecule wherein each scoop is processed as 
follows. 

In step 809, the properties of the query molecule that are relevant to the scoop are 
identified, and the relevant properties are mapped to a property field that represents the value of 
the property at pomts in the input reference firame of the query molecule. This operation is 
similar to the processing described above with respect to step 501 of the training phase. 

In step 811, the property field generated in step 809 is mapped to grid points contained in 
the scoop to determine contribution of property field at the grid points. Preferably, the 
contribution of property field at a given grid point includes two property values as described 
above: a positive mass value and a real charge value. This operation is similar to the processing 
described above with respect to step 503 of the training phase. 

In step 813, a descriptor vector for the scoop is generated based upon the property values 
(mass values and charge values) of the grid points of the scoop generated in step 811 . 
Preferably, the components of the descriptor vector for the scoop includes one or more of the 
data values as described above with respect to step 505 of the training phase. 

In step 815, the mapping generated in step 509 of the training phase is used to map one 
or more components of the descriptor vector for the scoop generated in step 813 (preferably, the 
component(s) of tiie descriptor vector are mapped to a space that optimally discrunmates 
between the groups of scoops as described above with respect to step 217 of FIG. 2). This 
operation is similar to the processing described above with respect to step 607 of the acquisition 
phase. 

In step 817, a key is generated based upon the mapping of the component(s) of the 
descriptor vector generated in step 815. The key identifies an entry in the multi-map stored in 
persistent memory. This operation is similar to the processing described above with respect to 
step 609 of the acquisition phase. 
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In step 819, the multi-map entry identified by the key generated in step 817 is retrieved 
and the data stored therein are read from the retrieved multi-map entry. 

In step 821, for each scoop identified by the retrieved multi-map entry (i.e., each entry 
segment in FIG. 7), an hypothesized match is constructed to determine a set of transformation 
parameters whereby a set of axes derived from property distribution of the query scoop is aligned 
with a set of axes derived from property distribution for each molecular scoop identified by the 
retrieved multi-map entry, which for the sake of description is referred to below as the stored 
scoop. 

In step 823, a label corresponding to transformation parameters of the hypothesized 
match is generated, and the hypothesized match is added to a vote table. The vote table is an 
associative memory of entries keyed by the label. Each entry of the vote table stores i) an 
accumulated score of the number of stored scoops whose hypothesized match corresponds to the 
label, and ii) data identifying those stored scoops whose hypothesized match corresponds to the 
label. Preferably, the vote table is implemented as a map. A detailed description of a map is set 
forth in D.R. Musser and Atul Saini, STL Tutorial and Reference Guide: C++ Programmuig with 
the Standard Template Library (Addison-Wesley 1996), incorporated by reference above in its 
entirety. The map may be formed from a hash table. A more detailed description of a hash table 
may be found in R. Sedgewick, Algorithms in C++ (Addison-Wesley 1992), incorporated by 
reference above in its entirety. In the alternative, the map may be formed from a linked list data 
structure, or a tree structure such as an AVL-tree or B* tree as described in D.R. Musser and 
Atul Saini. One skilled in the art will realize that there are many possible underlying 
implementations for the map data structure. 

In step 827, after processing all of the stored scoops identified by the retrieved multi-map 
entry (step 819), the vote table is sorted to determine a set of potential matching scoops. For 
example, one or more entries with the highest score may be selected, and those scoops identified 



Y0998-112 



-37- 



by the selected entries of the vote table may be selected as the set of potential matching scoops 
for the query molecule. 

Finally, in step 829, the set of potential matching scoops (and/or the molecules to which 
the set of potential matching scoops belong) is made accessible to the user, for example, via 
graphical user interface 907 of the computer processing system. 

In the preferred embodiment of the present invention, the operations of the recognition 
phase in constructing the hypothesized match, adding the hypothesized match to the vote table, 
and sorting the vote table to report results to the user are illustrated hi the flow chart of FIGS. 
9(A) and (B). As described above, each entry of the multi-map generated in step 611 of the 
acquisition phase preferably stores: i) data characterizing transformation between the input 
reference frame and the sensed inertial reference frame for a given stored scoop - which may 
include the components Sx , Sy , Sz, Se, So , So of the descriptor vector for the stored scoop: ii) 
data identifying the molecule to which the stored scoop belongs, denoted the stored molecule; 
and iii) data representing the geometric center in the mput reference frame of the stored 
molecule. Similar operations are performed in the recognition phase in constructuig the 
hypothesized match as illustrated in the flow chart of FIGS. 9(A) and (B). 

More specifically, in step 901, data characterizing transformation between the input 
reference frame and the sensed inertial reference frame for the query scoop - which may include 
the components Sx , Sy , Sz, Se, So , Sn of the descriptor vector for the query scoop is 
generated. 

In step 903, data characterizing transformation between the input reference frame for the 
stored molecule (denoted "stored frame") and the sensed inertial reference frame of the stored 

AAA 

scoop (Ui^ U2 , U3 of the stored scoop as described below) and data characterizing 
transformation between the input reference frame of the query molecule (denoted "query frame'') 

AAA 

and the sensed inertial reference frame of the query scoop (Ml, U2 , U3 of the query scoop as 
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described below) are used to calculate a transformation (a translation and rotation 
transformation) that aligns the two sensed inertial frames, which also represents a transformation 
from the stored frame to the query frame. Details of an exemplary technique for calculating this 
transformation may be found in K. Turkowski, "Graphics Gems/' Academic Press, edited by A. 
Glassner, pgs. 522-5325 herein incorporated by reference in its entirety. 

In step 905, the alignment transformation generated in step 903 is applied to the data 
representing the geometric center of the stored molecule, which is preferably retrieved from data 
included in the matching entry of the multi-map, to generate data representing center of the 
stored molecule in the query frame. 

In step 907, it is determined if the center of the stored molecule in the query frame lies 
within the volume of the query molecule. The volume of the query molecule may be calculated 
utilizing various well-known techniques, such as the techniques described in Connelly, M.L., 
"ComputationofMolecularVolume," JACS, Vol 107, 1985, pg. 1118-1124, hereinafter 
incorporated by reference in its entirety. 

If the test of step 907 fails (the center of the stored molecule in the query frame lies 
outside the volume of the query molecule), in step 909 the construction of the match hypothesis 
ends without adding an entry to the vote table, and processing continues to step 909 

However, if the comparison of step 907 is successful, in step 91 1 data representing the 
rotation of the alignment transformation generated in step 903 and data representing center of the 
stored molecule in the query frame are preferably quantized to form a integer pair, and a label is 
generated based upon such data and the identifier of the stored molecule. 

In step 913, if the vote table does not include an entry corresponding to the label 
generated in step 91 1, a new entry associated with the corresponding label is added to the vote 
table. The new entry includes a score field with an initial value (for example, 1), data 
identifying the given stored scoop, and data identifying the alignment transformation generated in 
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step 903. Otherwise (the vote table does not include an entry corresponding to the label 
generated in step 91 1), the score field of the corresponding entry is incremented and possibly 
data identifying the stored scoop is added to the entry. In addition, the aUgnment transformation 
data may be updated, for example, to represent the cumulative average alignment transformation. 
The operation then continues to step 909. 

In step 909, one or more entries of the vote table is selected, and the alignment 
transformation corresponding to the selected entry (ies) is applied to the corresponding stored 
molecule (which is identified by the label associated with the vote table entry) to generate an 
alignment of the stored molecule in the query frame. 

Finally, in step 915, the aUgnment of the stored molecule in the query frame is reported to 
the user via an I/O device and operation of the recognition phase ends. 

It should be noted that the operations described above with respect to FIG. 9 represent a 
preferred embodiment of the present invention. One skilled in the art will realize that the 
operations similar to those described above can be used to construct an hypothesized match 
between a stored scoop and a query scoop based upon any data characterizing a set of axes 
derived fi-om the property distribution of the stored scoop and query scoop, respectively. 

As described above, the training phase is used to define the association criteria between 
query scoops and stored scoops, and keys and the corresponding multi-map data structure capture 
the associations. In an alternate embodiment, any arbitrary selection scheme (for example, one 
based on a distance metric) can be used to associate a query scoop with a stored scoop, which 
leads to construction of a match hypothesis between the query scoop and the associated stored 
scoop. 

The computer processing system 100 that implements the present invention may be 
distributed in nature as shown in FIG. 1(B). More specifically, a distributed computer processing 
system comprises more than one CPU 103 (three shown 103-1, 103-2,103-3) with each of these 
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CPUs conununicating with one another via message passing utility 113. The message passing 
utility 113 may be implemented via shared memory, a network connection, a high speed switch 
or some other method that allows data to be passed from CPU to CPU. A distributed computer 
processing system is preferably used for the recognition phase of the present invention because of 
the inherently parallel nature of the algorithm. More specifically, the multi-map data structure 
generated in the acquisition mode as described below is preferably partitioned amongst the CPUs 
of the distributed system. The multi-map data structure may also be partitioned amongst the 
various nonvolatile storage devices 108 associated with a given CPU 103. For example, the 
multi-map may be partitioned into nine portions MMl 1, MM12, MM13, MM21, MM22, MM23, 
MM31, MM32, MM33 among three CPUs 103-1,103-2 and 103-3 and their associated nine 
storage devices 108-11, 108-12 ,108-13 , 108-21,1 08-22, 108-23, 108-31, 108-32, 108-33 as 
shown. In addition, the vote table data structure may be similarly partitioned amongst the CPUs 
of the distributed system. When such a system is used in the recognition phase described above, 
as table entries are retrieved from the multi-map, such table entries are routed via the message 
passing utility 1 13 to the appropriate CPU for accumulation in the proper segment of the vote 
table. In the end, a distributed merge sort is preferably used to collate all of the resulting 
hypothesized matches on a single CPU. 

Derivation of Components of Descriptor Vecto rs for a Scoop 

As described above, the components of the descriptor vector characterizing a given scoop 
preferably includes Sx , Sy , S^, which represent the components of a vector describing translation 
between the origin of input reference frame and center of mass in the input reference frame. 
These components may be calculated utilizing well known techniques, including the operations 
described in column 6 of U.S. Patent No. 5,784,294 to Piatt et al., incorporated by reference 
above in its entirety. 

In addition, the components of the descriptor vector characterizing a given scoop 
preferably includes the principal values of the moment of mertia tensor I for the scoop, which 
are denoted I^ , ly , Iz • The inertial reference frame for the scoop is preferably characterized by 



Y0998-112 



-41- 



an origin at the center of mass in the input reference frame, and a set of three axes (denoted U i , 

A A AAA 

^2 , ^3 ) in the input reference frame. The axes Z/i , U2 , U3 are unit vectors that point in the 
same direction as corresponding eigenvectors (principal axes) of the diagonalized moment of 
inertia tensor I of the scoop. The principal values of the moment of inertia tensor I of the scoop 
(Ix , ly 5 Iz) are represented by the eigenvalues of the diagonalized moment of inertia tensor I of the 
scoop. The eigenvectors (principal axes) and the corresponding eigenvalues of the diagonalized 
moment of inertia tensor I for the scoop may be calculated utilizing the operations described in 
columns 6-8 of U.S. Patent No. 5,784,294 to Piatt et al., incorporated by reference above in its 
entirety. 

AAA 

Importantly, the three axes Ui, U2 , U3 do not sufficiently denote an internal frame of 
reference for coordinate transformation because the signs of the corresponding eigenvectors are 
ambiguous. The signs must be determined from other information. 

A A 

Accordingly, the present invention provides a procedure for sensing the axes U\, U2 , 

A 

U3 with a third order moment vector, which is denoted the asymmetric vector, in order to 
denote an internal frame of reference sufficient for coordinate transformation. The asymmetric 

vector has a general form C = (Cx , Cy , Cz ). This general form may characterize a third order 

moment of mass about a center of expansion, denoted above as C/ = (Cix , Ciy , Ciz ). In the 
alternative, the general form may characterize a third order moment of charge about center of 

expansion, denoted above as = (Cqx , Cqy , Cqz ). 

In the preferred embodiment of the present invention, the components of the asymmetric 
-> 

vector C ^ (Cx , Cy , Cz ) are derived as follows: 

where ~x represents a vector between the center of expansion and any arbitrary 
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point in the input reference frame, and 



p(lt) represents the relevant distribution (i.e., represents the 

mass distribution for the asymmetric vector 

characterizing third order moment of mass, or represents the 

charge distribution for asymmetric vector characterizing 

third order moment of charge) of the scoop at points in the input reference 

frame corresponding to the vector it . 

In systems where the distribution p(jt) of the scoop has discrete values over points in the input 

reference frame, the components of the asymmetric vector C = (Cx , Cy , Cz ) may be derived as 
follows: 

C — 2^ Pi \ 

where represents a vector between the center of expansion in the input 
reference frame and a grid point / in the input reference frame, 

Pi is the mass/charge property at the grid point / (when generating the 
asymmetric vector characterizing third order moment of mass, is the 
mass value fi / for the grid point / as described above 
with respect to step 503; and when generating the asymmetric vector 
characterizing third order moment of charge, is the charge value Xi 
for the grid point i as described above with respect to 
step 503); and 

2 sums over the grid points of the scoop. 

With respect to the asymmetric vector characterizing third order moment of mass for the 
scoop, preferably the center of expansion is the center of mass of the scoop. With respect to the 
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asymmetric vector Cq characterizing third order moment of charge for the scoop, the center of 
expansion may be the center of charge, center of dipole, or center of quadrupole for the scoop. 



As described above, the components of asymmetric vector C (such as the asymmetric 

-> , 
vector Ci characterizing third order moment of mass for the scoop, or the asymmetric vector Cq 

A A 

characterizing third order moment of charge for the scoop) are used to sense the axes i , U2 , 

A 

U3 in order to denote an internal frame of reference sufficient for coordinate transformation. 

AAA 

Preferably, the axes Ui^ U2 , U3 are sensed by looping over a counter j ranging from 1 to 2 
v^ith an increment of 1 . In each iteration of the loop, the following conditional operation is 

A 

performed: if the dot product of the axis U\y] and the asymmetric vector is less than zero, then 

A A A 

the sign of U\y] is swapped (i.e., M[y] = - ^[y]). In the first iteration of the loop, the counter j 

A 

is 1 and the conditional operation tests whether the dot product of the axis U 1 and the 

A A 

asymmetric vector is less than zero. If so, the sign of W 1 is swapped. Otherwise, the sign of U 1 
remains unchanged. In the second iteration of the loop, the counter is 2 and the conditional 

A 

Operation tests whether the dot product of the axis U2 and the asymmetric is less than zero. If 

A A 

so, the sign of U2 is swapped. Otherwise, the sign of U2 remains vmchanged. The loop then 

A A A 

terminates, and the third axis U3 is constructed as the cross product of U \ and 

AAA 

Importantly, the sensed axes 1 , U2 , U3 denote a reference frame suitable for 
coordinate transformation. More specifically, this reference frame is characterized by an origin 

AAA 

at the center of mass in the input reference frame, and the sensed axes ( 1/ 1 , U2 , 3 in the input 
reference frame). This reference frame is preferably used to generate components of the 
descriptor vector that characterizes a scoop. 

In addition, the components of the descriptor vector characterizing a given scoop includes 
the components of a transformation matrix representing rotation between the principal axes of 
moment of inertia tensor I for the scoop and the axes of input reference frame. Preferably, this 
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transformation matrix, which for the sake of description is denoted R , is derived from the sensed 

A A A 

axes ui (= uux + uiyy 4- u\zz) , m (= U2xX + uiyy + U2zz), and m (= u^xX + U3yy + wa^z) as 
follows: 



Note that this representation has nine terms. Preferably, the transformation matrix R may be 
represented by three angles: Se? S3> , Sn. So) , Se are polar angles that describe an axis of 
rotation. Sa represents an angle of rotation about that angle. Together, the angles represent the 
rotation transformation represented by the transformation matrix R. The rotation matrix R and 
the angles Se^ So , Sn can be calculated using the technique described in M. Pique, "Graphics 
Gems," Academic Press, edited by A. Glassner, pgs. 465-467, herein incorporated by reference in 
its entirety. Note that this representation of the rotation transformation using the angles Se, Sa» , 
Sn only has three terms, which is advantageous because it lowers the storage allocation 
requirements for the components of the descriptor vector of the scoop that represent the 
transformation. 

In addition, the components of the descriptor vector characterizing a given scoop 
preferably includes the components of a vector describing translation between center of scoop in 
the inertial reference frame and center of mass in the inertial reference frame, which are denoted 
Vx , Vy , Vz. Preferably, the components Vx , Vy , and Vz are derived by generating a vector 
representing translation between center of scoop in the input reference frame and center of mass 
in the input reference frame, and then applying the rotation transformation matrix R described 
above to transform the vector from the input reference frame to the inertial reference frame 

AAA 

represented by the sensed axes 2i i , U2 , U3, 



U\x U\y 

R = U2x U2y 
U3x U3y 



U2z 
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In addition, tlie components of the descriptor vector characterizing a given scoop 
preferably includes the components of a vector describing translation between center of mass in 
the inertial reference frame and center of dipole in the inertial reference frame, which are denoted 
dx , dy , dz . Preferably, the components dx , dy , and dz are derived by generating a vector 
representing translation between center of mass in the input reference frame and center of dipole 
mass in the input reference frame, and then applying the rotation transformation matrix R 
described above to transform the vector from the input reference frame to the inertial reference 

AAA 

frame represented by the sensed axes Ml, U2 , U3. The center of dipole in the inertial reference 
frame may calcvilated utilizing the operations described in columns 10-11 of U.S. Patent No. 
5,784,294 to Piatt et. al., incorporated by reference above in its entirety. 

In addition, the components of the descriptor vector characterizing a given scoop 
preferably includes one or more components of a tensor characterizing the quadrupolar moment 
about a center of expansion, which are denoted Qxx , Qxy , Qxz, Qyy , Qyz , Qzz. Preferably, the 
components Qxx , Qxy , Qxz, Qyy , Qyz , Qzz chaTactcrize contribution of such quadrupolar moment 

AAA 

along the sensed axes Ui, U2 , U3, In this case, components of a tensor Q characterizing 
quadrupolar moment about a center of expansion are generated. The components of the tensor Q 
may be calculated utilizing the operations described in columns 10-1 1 of U.S. Patent No. 
5,784,294 to Piatt et. al, incorporated by reference above in its entirety. Finally, the following 
operations are performed with respect to the tensor Q to calculate contribution of the tensor Q 

AAA 

along the sensed axes Ui, U2 , U3: 



Qxx — 


U\ • 
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U\ 
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While the invention has been described in connection with specific embodiments, it will 
be understood that those with skill in the art may develop variations of the disclosed 
embodiments without departing from the spirit and scope of the following claims. 
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We claim: 



L In a data processing system wherein descriptor vectors associated with a plurality 
of regions of molecules are stored in a database, a method for generating and storing data 
characterizing at least one region of said plurality of regions, the method comprising the steps 
of: 

generating an entry comprising i) an identifier that identifies said at least one region, and 
ii) data characterizing a set of axes derived from property distribution of said at least one region; 

applying a mapping the descriptor vector associated with said at least one region; 

generating a key that corresponds to said mapping of the descriptor vector associated with 
said at least one region; and 

storing said entry in a memory, wherein said key is associated with said entry. 

2. The method of claim 1 , wherein said set of axes are invariant to rotation and 
translation of said at least one region. 

3. The method of claim 2, wherein said set of axes are derived from principal axes of 
said property distribution. 

4. The method of claim 3, wherein said property distribution of said at least one 
region is based upon application of a smearing function to a property field. 

5. The method of claim 1, wherein said plurality of descriptor vectors are classified 
into groups, wherein said mapping step maps said descriptor vector to a said space optimally 
discriminates between said groups of descriptor vectors. 
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6. 



The method of claim 5, wherein said mapping is derived from the steps of: 



generating first data representing differences between said groups of descriptor vectors; 

generating second data representing variations within said groups of descriptor vectors; 

identifying a set of component vectors that maximizes an F distributed criterion function, 
said criterion function having a numerator based upon said first data and a denominator based 
upon said second data; 

generating an F distributed statistic for subsets of said component vectors, said statistic 
having a numerator based upon said first data and a denominator based upon said second data; 

for each particular subset of component vectors, calculating a probability value for the 
F-distributed statistic associated with the particular subset; 

selecting a probability value from probability values for said subsets of component 
vectors based upon a predetermined criterion; 

identifying the subset of said component vectors associated with the selected probability 
value; and 

generating a mapping to a space corresponding to the subset of component vectors 
associated with the selected probability value, and storing the mapping for subsequent 
processing. 

7. The method of claim 6, wherein said first data comprises a matrix 8 b 
representing covariance between said groups of descriptor vectors, and said second data 
comprises a matrix representing covariance within said groups of descriptor vectors. 

8. The method of claim 7, wherein said criterion function has the general form: 
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where w is some vector, and C is a constant based upon degrees of freedom in and 6w . 

9. The method of claim 8, wherein C is determined as follows: 

_ l/degrees of freedom in Eb _ \ I {N - \) 
^ l/degrees of freedom in £w 1 / rij-N) 

where represents the number of groups of descriptor vectors, rij represents the number of 
regions, and X ^/ represents the sum of for the N groups. 

10. The method of claim 7, wherein the step of identifying a set of component vectors 
that maximizes an F distributed criterion function comprises the substeps of: 

determining a set of (eigenvalue, eigenvector) pairs for the matrix Sw 

determining said set of component vectors based upon said set of (eigenvalue, 
eigenvector) pairs for the matrix Sw . 

1 1 . The method of claim 1 0, wherein said statistic for a given subset of component 
vectors is based upon value of said criterion function for said subset of component vectors. 

12. The method of claim 11, wherein said statistic for a given subset of component 
vectors has the following form: 

where/ k represents the value of the criterion function at a 
component vector in the given subset. 
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C is a constant, 

Ls represents the number of values in the given subset 
of component vectors, and 

the S operation sums over the Z^" fk values in the given 
subset of component vectors. 

1 3 . The method of claim 1 2, wherein said a probability value for a particular 
F-distributed statistic represents a probability value that the particular F-distributed statistic could 
have been larger by chance. 

14. The method of claim 13, wherein said probability value selected from probability 
values for said subsets of component vectors is a minimum probability value of said probability 
values for said subsets of component vectors. 

1 5 . The method of claim 6, 

wherein said mapping for said at least one descriptor vector performs a loop over each 
component vector belonging to the subset of component vectors associated with the selected 
probability; 

wherein, in each iteration of said loop, dot product of said descriptor vector with a 
transpose of a unit vector for the given component vector is added to a running sum. 

16. In a data processing system wherein descriptor vectors associated with a plurality 
of regions of molecules are stored in a database, CHARACTERIZED IN THAT said data 
processing system includes a memory storing a plurality of entries each comprising i) an 
identifier that identifies at least one region and ii) data characterizing a set of axes derived from 
property distribution of said at least one region, a method for determining alignment of similar 
molecular structure comprising the steps of: 
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providing a descriptor vector associated with said query molecular region; 

mapping said descriptor vector associated with said query molecular region; 

generating a second key that corresponds to said mapping of said descriptor vector 
associated with said query molecular region; and 

retrieving from said memory entries that are associated with a first key that corresponds 
to said second key; and 

for at least one entry retrieved from said memory, 

generating data that represents a match hypothesis associated with said query 
molecular region and at least one region R identified by said at least one entry retrieved from said 
memory, wherein said data is based upon parameters of a transformation that aligns a set of axes 
derived from property distribution of said query molecular region with a set of axes derived from 
property distribution of said at least one region 

determining a score associated v^th said data, and 

storing said data and score as an entry in a vote table. 

1 7. The method of claim 1 6, further comprising the step of: 

selecting one or more entries of said vote table based upon said score associated 
with said entries; and 

identifying at least one region that corresponds to the selected entries of said vote 
table as a potential matching regions to said query molecular region. 
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1 8. The method of claim 1 6, wherein said set of axes derived from property 
distribution of a region are invariant to rotation and translation of said region. 

19. The method of claim 1 8, wherein said set of axes derived from property 
distribution of a region are derived from principal axes of said property distribution. 

20. The method of claim 1 9, wherein said property distribution of said region is based 
upon application of a smearing function to a property field, 

2 1 . The method of claim 1 6, wherein said plurality of descriptor vectors are classified 
into groups, and wherein said mapping step maps said descriptor vector to a space optimally 
discriminates between said groups of descriptor vectors. 

22. The method of claim 21 , wherein said mapping is derived from the steps of: 

generating first data representing differences between said groups of descriptor vectors; 

generating second data representing variations within said groups of descriptor vectors; 

identifying a set of component vectors that maximizes an F distributed criterion function, 
said criterion function having a numerator based upon said first data and a denominator based 
upon said second data; 

generating an F distributed statistic for subsets of said component vectors, said statistic 
having a numerator based upon said first data and a denominator based upon said second data; 

for each particular subset of component vectors, calculating a probability value for the 
F-distributed statistic associated with the particular subset; 
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selecting a probability value from probability values for said subsets of component 
vectors based upon a predetermined criterion; 



identifying the subset of said component vectors associated with the selected probability 
value; and 

generating a mapping to a space corresponding to the subset of component vectors 
associated with the selected probability value, and storing the mapping for subsequent 
processing. 

23. The method of claim 22, wherein said first data comprises a matrix E b 
representing covariance between said groups of descriptor vectors, and said second data 
comprises a matrix representing covariance within said groups of descriptor vectors. 

24. The method of claim 23, wherein said criterion function has the general form: 



where w is some vector, and C is a constant based upon degrees of freedom in Eb and . 
25. The method of claim 24, wherein C is determined as follows: 



where represents the number of groups of descriptor vectors, Ui represents the number of 



26. The method of claim 23, wherein the step of identifying a set of component 
vectors that maximizes an F distributed criterion function comprises the substeps of: 

determining a set of (eigenvalue, eigenvector) pairs for the matrix Sw 




1 /degrees of freedom in st _ I / (N ~ \) 



1/degrees of freedom 




regions, and S represents the sum of n, for the N groups. 
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determining said set of component vectors based upon said set of (eigenvalue, 
eigenvector) pairs for the matrix Sw * 

27. The method of claim 26, wherein said statistic for a given subset of component 
vectors is based upon value of said criterion function for said subset of component vectors. 

28. The method of claim 27, wherein said statistic for a given subset of component 
vectors has the following form: 



28. The method of claim 22, wherein said a probability value for a particular 
F-distributed statistic represents a probability value that the particular F-distributed statistic could 
have been larger by chance. 

29. The method of claim 28, wherein said probability value selected from probability 
values for said subsets of component vectors is a minimum probability value of said probability 
values for said subsets of component vectors. 

30. The method of claim 22, 




where fk represents the value of the criterion function at a 
component vector in the given subset. 



C is a constant, 



Ls represents the number of fk values in the given subset 
of component vectors, and 

the 2 operation sums over the Ls fk values in the given 
subset of component vectors. 
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wherein said mapping for said at least one descriptor vector performs a loop over each 
component vector belonging to the subset of component vectors associated v^ith the selected 
probability; 

wherein, in each iteration of said loop, dot product of said descriptor vector with a 
transpose of a \mit vector for the given component vector is added to a running sum. 
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SIMILARITY SEARCHING OF MOLECULES BASED UPON 
STATISTICAL ANALYSIS OF DESCRIPTOR VECTORS 
CHARACTERIZING MOLECULAR REGIONS 



Abstract of the Disclosure 

5 The method of the present invention provides for similarity searching of molecules based 

upon statistical analysis of descriptor vectors characterizing molecular regions. In a training 
phase, an association criterion is generated by which query regions of a query molecule are 
associated with regions of molecules stored in a database. Preferably, the association criterion is 
based upon statistical analysis of groups of descriptor vectors that characterize properties of the 
lOm regions of the molecules stores in the database. In an acquisition phase, for each molecule in a 
IJ} series of molecules, the following steps are performed for a given molecule. Data that represents 
^} the structure of the given molecule is read from persistent memory and used to define a set of 
I3\ three-dimensional regions of space in the given molecule. For each region, one or properties of 
'f" the given molecule are mapped to property values for grid points of the region. A multi-map 
1 5O entry is generated that identifies the region, and position and orientation of set of axes derived 
ry from the property values of the grid points of the region. The association criterion generated in 
% the training phase is used generate a key for the region, and the entry is stored in the multi-map at 
^fl a location associated with the key. In the recognition phase, data that represents the structure of a 
query molecule is used to define a set of regions in the query molecule. For each region, one or 
20 properties of the query molecule are mapped to property values for grid points of the query 

region. The association criterion generated in the training phase is used generate a key for the 
query region. The multi-map entry identified by the key is retrieved and the data stored therein 
are read from the table. For each stored region identified by the retrieved table entry, an 
hypothesized match is constructed and added to a vote table. After processing all of the stored 
25 regions identified by the retrieved multi-map entry for the set of query regions in the query 

molecule, one or more entries of the vote table is selected, the alignment transformations stored 
in the selected entries are applied to corresponding molecules stored in the database, and the 
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resultant alignment(s) of the stored molecule in the query frame is reported to the user 
device. 
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